DATASET

CrimeRate - Per capita crime rate by city Indus - Proportion of non-retail business acres per city. River – River front or not (1-bounds river; 0-otherwise) AvgRoom - Average number of rooms per dwelling per city Age – Average age of a house by city Tax - full-value property-tax rate per $10,000 PTRatio - pupil-teacher ratio by city LStat - % lower status of the population by city MedPrice - Median price of owner-occupied homes in $1000’s

Load necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(leaps)
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(nortest)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Ingest data

cincinnati <- read_csv("CINCINNATI.csv")
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cincinnati)
## # A tibble: 6 × 9
##   CrimeRate Indus River AvgRoom   Age   Tax PTRatio LStat MedPrice
##       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl>
## 1    0.0273  7.07     0    7.18    61   242    17.8  4.03     34.7
## 2    0.254   9.9      0    5.70    78   304    18.4 11.5      16.2
## 3    0.0396  5.19     0    6.04    35   224    20.2  8.01     21.1
## 4    0.145  10.0      0    5.73    65   432    17.8 13.6      19.3
## 5    0.786   3.97     0    7.01    85   264    13   14.8      30.7
## 6   15.3    18.1      0    6.65    93   666    20.2 23.2      13.9

Data cleaning

#Check for missing values
any(is.na(cincinnati))
## [1] FALSE
#Check for duplicates
colSums(is.na(cincinnati))
## CrimeRate     Indus     River   AvgRoom       Age       Tax   PTRatio     LStat 
##         0         0         0         0         0         0         0         0 
##  MedPrice 
##         0
#Check for correct format
str(cincinnati$CrimeRate)
##  num [1:150] 0.0273 0.2536 0.0396 0.1448 0.7857 ...
str(cincinnati$Indus)
##  num [1:150] 7.07 9.9 5.19 10.01 3.97 ...
str(cincinnati$River)
##  num [1:150] 0 0 0 0 0 0 0 0 0 0 ...
str(cincinnati$AvgRoom)
##  num [1:150] 7.18 5.71 6.04 5.73 7.01 ...
str(cincinnati$Age)
##  num [1:150] 61 78 35 65 85 93 32 88 46 97 ...
str(cincinnati$Tax)
##  num [1:150] 242 304 224 432 264 666 300 666 398 403 ...
str(cincinnati$PTRatio)
##  num [1:150] 17.8 18.4 20.2 17.8 13 20.2 15.3 20.2 18.7 14.7 ...
str(cincinnati$LStat)
##  num [1:150] 4.03 11.5 8.01 13.61 14.79 ...
str(cincinnati$MedPrice)
##  num [1:150] 34.7 16.2 21.1 19.3 30.7 13.9 22 14.3 20.8 41.3 ...
#Check for inconsistencies/invalid values
any(cincinnati$CrimeRate < 0)
## [1] FALSE
any(cincinnati$Indus <= 0)
## [1] FALSE
any(cincinnati$River < 0 | cincinnati$River > 1)
## [1] FALSE
any(cincinnati$AvgRoom < 0)
## [1] FALSE
any(cincinnati$Age < 0)
## [1] FALSE
any(cincinnati$Age <= 0)
## [1] FALSE
any(cincinnati$Tax <= 0)
## [1] FALSE
any(cincinnati$PTRatio <= 0)
## [1] FALSE
any(cincinnati$PTRatio <= 0)
## [1] FALSE
any(cincinnati$LStat < 0 | cincinnati$LStat > 100)
## [1] FALSE
any(cincinnati$MedPrice <= 0)
## [1] FALSE
#Look for outliers, determine how to handle them
summary(cincinnati$CrimeRate)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01096  0.07250  0.19160  3.34698  2.69719 88.97620
boxplot(cincinnati$CrimeRate, main = "Boxplot of Crime Rate")

hist(cincinnati$CrimeRate, breaks = 20, col = "lightcoral", main = "Histogram of Crime Rate")

#Conclusion: There are two high crimerate outliers, but they are still reasonable, and without knowing which city the crime rate is associated with, there's no way to investigate to see if we can justify removing it 

summary(cincinnati$Indus)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.740   4.935   8.560  10.475  18.100  25.650
boxplot(cincinnati$Indus, main = "Boxplot of Porportion of Industry")

hist(cincinnati$Indus, breaks = 20, col = "lightcoral", main = "Histogram of Propotion of Industry")

#Conclusion: No outliers

summary(cincinnati$AvgRoom)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.880   5.962   6.321   6.341   6.684   8.704
boxplot(cincinnati$AvgRoom, main = "Boxplot of Average Number of Rooms")

hist(cincinnati$AvgRoom, breaks = 20, col = "lightcoral", main = "Histogram of Average Number of Rooms")

#Conclusion: No outliers

summary(cincinnati$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   41.00   69.50   65.07   91.75  100.00
boxplot(cincinnati$Age, main = "Boxplot of Average Number of Rooms")

hist(cincinnati$Age, breaks = 20, col = "lightcoral", main = "Histogram of Average Number of Rooms")

#Conclusion: No outliers

summary(cincinnati$Tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   188.0   284.0   330.0   399.2   437.0   666.0
boxplot(cincinnati$Tax, main = "Boxplot of Tax Rate")

hist(cincinnati$Tax, breaks = 20, col = "lightcoral", main = "Histogram of Tax Rate")

outlier <- subset(cincinnati, Tax == 666)
outlier
## # A tibble: 36 × 9
##    CrimeRate Indus River AvgRoom   Age   Tax PTRatio LStat MedPrice
##        <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl>
##  1     15.3   18.1     0    6.65    93   666    20.2  23.2     13.9
##  2      5.58  18.1     0    6.44    88   666    20.2  16.2     14.3
##  3      3.68  18.1     0    5.36    96   666    20.2  10.2     20.8
##  4     16.8   18.1     0    5.28    98   666    20.2  30.8      7.2
##  5      3.85  18.1     1    6.40    91   666    20.2  13.3     21.7
##  6     14.3   18.1     0    6.23    88   666    20.2  13.1     21.4
##  7      5.82  18.1     0    6.51    90   666    20.2  10.3     20.2
##  8      9.72  18.1     0    6.41    97   666    20.2  19.5     17.1
##  9     10.1   18.1     0    6.83    94   666    20.2  19.7     14.1
## 10     19.6   18.1     0    7.31    98   666    20.2  13.4     15  
## # ℹ 26 more rows
#Conclusion: Initially it looks like there is an outlier on 666, but upon further examination, there are 36 records with this value, meaning it's not an outlier. Need to figure out why this is here. 

summary(cincinnati$PTRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   16.65   18.65   18.28   20.20   21.20
boxplot(cincinnati$PTRatio, main = "Boxplot of Pupil-teacher ratio")

hist(cincinnati$PTRatio, breaks = 20, col = "lightcoral", main = "Histogram of Pupil-teacher ratio")

#Conclusion: One entry is a bit low, but not an outlier

summary(cincinnati$LStat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.470   6.992  10.475  11.850  15.027  34.410
boxplot(cincinnati$LStat, main = "Boxplot of Lower Status Proportion")

hist(cincinnati$LStat, breaks = 20, col = "lightcoral", main = "Histogram of Lower Status Proportion")

#Conclusion: No outliers

summary(cincinnati$MedPrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.60   17.73   21.45   22.62   24.77   50.00
boxplot(cincinnati$MedPrice, main = "Boxplot of Median House Price")

hist(cincinnati$MedPrice, breaks = 20, col = "lightcoral", main = "Histogram of Median House Price")

#Conclusion: No outliers

Exploratory Analysis

#Analyze relationship between variables
ggpairs(cincinnati)

ggcorr(cincinnati, label = TRUE, label_round = 2, palette = "RdYlBu")

#Biggest impact on Median Home Price:
#-AvgRooms 0.747 (More rooms = more $)
#-LStat -0.751 (Lower status = less $)
#-PTRatio -0.575 (higher ratio = less teachers/student = less $)
#-Indus -0.574 (more non-retail business = less appealing to live at = less $)
#-Tax -0.540 (higher tax rate = lower $)
#-Age -0.482 (the older the house = less $)

#Possible Variable interdependency
#-Tax/CrimeRate 0.538
#-Tax/Indus 0.698
#-Age/Indus 0.671
#-LStat/Indus 0.645
#-LStat/AvgRoom -0.615
#-Tax/Age  0.508
#-Age/LStat 0.654
#-LStat/Tax 0.565


#Check linear regression assumption of linearity
ggplot(cincinnati, aes(x = CrimeRate, y = MedPrice)) + 
  geom_point() +  
  labs(title = "Crimerate vs Price",
       x = "Crimerate",
       y = "Price") +
  theme_minimal()             

#Result: Negative non-linear

ggplot(cincinnati, aes(x = River, y = MedPrice)) + 
  geom_point() +  
  labs(title = "River vs Price",
       x = "River",
       y = "Price") +
  theme_minimal()             

#Result: Small negative correlation for on river

ggplot(cincinnati, aes(x = AvgRoom, y = MedPrice)) + 
  geom_point() +  
  labs(title = "AvgRoom vs Price",
       x = "AvgRoom",
       y = "Price") +
  theme_minimal()             

#Result: Clear Positive Linear 

ggplot(cincinnati, aes(x = Age, y = MedPrice)) + 
  geom_point() +  
  labs(title = "Age vs Price",
       x = "Age",
       y = "Price") +
  theme_minimal()             

#Result: Negative Non-linear 

ggplot(cincinnati, aes(x = Tax, y = MedPrice)) + 
  geom_point() +  
  labs(title = "Tax vs Price",
       x = "Tax",
       y = "Price") +
  theme_minimal()             

#Result: General Negative Correlation, but a weird grouping at the high end of specific tax rate

ggplot(cincinnati, aes(x = PTRatio, y = MedPrice)) + 
  geom_point() +  
  labs(title = "PTRatio vs Price",
       x = "PTRatio",
       y = "Price") +
  theme_minimal()             

#Result: Loose linear negative

ggplot(cincinnati, aes(x = LStat, y = MedPrice)) + 
  geom_point() +  
  labs(title = "LStat vs Price",
       x = "LStat",
       y = "Price") +
  theme_minimal()             

#Result: Non-linear negative

Research Question Investigation

#How does the crime rate affect housing prices across different price segments?
#create price quartiles
cincinnati$price_quartile <- cut(cincinnati$MedPrice, 
                                 breaks = quantile(cincinnati$MedPrice, probs = 0:4/4), 
                                 include.lowest = TRUE, 
                                 labels = c("Q1", "Q2", "Q3", "Q4"))

# calculate mean crime rate for each quartile
mean_crime_by_quartile <- cincinnati %>%
  group_by(price_quartile) %>%
  summarize(mean_crime_rate = mean(CrimeRate, na.rm = TRUE))

#create scatter plot with trend lines
ggplot(cincinnati, aes(x = MedPrice, y = CrimeRate)) +
  geom_point(aes(color = price_quartile), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, aes(group = price_quartile)) +
  labs(title = "Crime Rate vs Housing Prices by Price Quartile",
       x = "Housing Price",
       y = "Crime Rate") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: As we can see here, it appears that crimerate has the highest impact on low price homes. We can see that the lowest price quartile (Q1) seems to have the highest crime rates, and the trend line indicates that there is a strong negative correlation with housing price and crime rate in Q1. Although there are some smal trends in the other quartiles, they aren’t really clear enough to draw meaningful conclusions. From this, it appears that high crime disproportionately impacts lower price housing more than any other pricing group.

# Investigating the RelationShip between Housing Age and Room Size
ggplot(cincinnati, aes(x = Age, y = AvgRoom)) +
  geom_point(color = "blue") +
  labs(title = "Relationship between Housing Age and Average Room Size",
       x = "Housing Age (Years)",
       y = "Average Number of Rooms") +
  theme_minimal()

# Calculate correlation coefficient
age_avgroom_cor <- cor(cincinnati$Age, cincinnati$AvgRoom)
print(paste("The Pearson correlation coefficient between Age and AvgRoom is:", round(age_avgroom_cor, 3)))
## [1] "The Pearson correlation coefficient between Age and AvgRoom is: -0.242"
cor.test(cincinnati$Age, cincinnati$AvgRoom)
## 
##  Pearson's product-moment correlation
## 
## data:  cincinnati$Age and cincinnati$AvgRoom
## t = -3.0386, df = 148, p-value = 0.002811
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.38753602 -0.08537876
## sample estimates:
##        cor 
## -0.2423246

Interpretation: The correlation between the age and AvgRoom is -0.2423 , which means there is a mild negative linear relationship between them and as house get older, they tend to have slightly fewer rooms on average. Upon checking the p-value which is 0.002811 < 0.05 , we can say the observation is significantly significant. Also the confidence interval does not contain zero, it supports that true correlation is different from zero and that there is a statistically significant negative relationship. We can conclude that, in our dataset, as the housing age increases, there is a slight tendency for the average number of rooms to decrease.

#Investigating: Is there a relationship between the student-teacher ratio (PTRatio) and the crime rate (CrimeRate) in the city?
ggplot(cincinnati, aes(x = PTRatio, y = CrimeRate)) +
  geom_point(color = "blue") +
  labs(title = "Relationship between PTRatio and CrimeRate",
       x = "PTRatio",
       y = "CrimeRate") +
  theme_minimal()

#Calculate correlation coefficient
ptratio_crimerate_cor <- cor(cincinnati$PTRatio, cincinnati$CrimeRate)
print(paste("The Pearson correlation coefficient between PTRatio and CrimeRate is:", round(ptratio_crimerate_cor, 3)))
## [1] "The Pearson correlation coefficient between PTRatio and CrimeRate is: 0.275"
cor.test(cincinnati$PTRatio, cincinnati$CrimeRate)
## 
##  Pearson's product-moment correlation
## 
## data:  cincinnati$PTRatio and cincinnati$CrimeRate
## t = 3.4742, df = 148, p-value = 0.0006723
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1196018 0.4165308
## sample estimates:
##       cor 
## 0.2745999

Interpretation: The correlation between PTRatio and Crimerate is 0.274, which suggests that as the ratio of students to teachers increases, so does the crimerate. This intuitively makes sense, because less affuent cities will have less teachers for students, and areas such as this are typically more prone to crime. Upon checking the p-value which is 0.00067 < 0.05 , we can say the observation is significantly significant. Also the confidence interval does not contain zero, it supports that true correlation is different from zero and that there is a statistically significant negative relationship. We can conclude that, in our dataset, as the housing age increases, there is a slight tendency for the average number of rooms to decrease.

#Investigating: How do socioeconomic factors (LStat variable) correlate with property tax rates?
#   LStat - % lower status of the population by city
#   Tax - full-value property-tax rate per $10,000
cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cincinnati, aes(x = Tax, y = LStat)) +
  geom_point() +
  labs(
    title = "Tax vs LStat"
  )

# Attempting to explain strange grouping found in scatter plot
#   grouping by river
ggplot(cincinnati, aes(x = Tax, y = LStat, color=as.factor(River))) +
  geom_point() +
  labs(
    title = "Tax vs LStat by River",
    color = "River"
  )

#   grouping by CrimeRate Category
cincinnati$CrimeRate <- cut(
  cincinnati$CrimeRate,
  breaks = c(0, 1, 5, Inf),    # Define breakpoints for "low", "moderate", "high"
  labels = c("low", "moderate", "high")      # Labels for each category
)
ggplot(cincinnati, aes(x = Tax, y = LStat, color=as.factor(CrimeRate))) +
  geom_point() +
  labs(
    title = "Tax vs LStat by Crime rate category",
    color = "Crime rate category"
  )

#   grouping by Indus
cincinnati$Indus <- cut(
  cincinnati$Indus,
  breaks = c(0, 13.60556, Inf), # this cutoff is explained in the next block
  labels = c("low", "high")      
)
ggplot(cincinnati, aes(x = Tax, y = LStat, color=Indus)) +
  geom_point() +
  labs(
    title = "Tax vs LStat by Indus",
    color = "Indus"
  )

Interpretation: There is a positive linear relationship between Tax and LStat, in general as Tax increases LStat also increases. However there seems to be a strange group around a value of ~675 for Tax. Examining the group by coloring by different categorical variables: - River doesn’t seem to have any effect on the grouping, there aren’t enough data points that bound the river (1) to make conclusions. - Crime rate categories of ‘moderate’ and ‘high’ seem to correspond well with the grouping. - Data points with high Indus levels seem to all be part of the strange grouping.

#Investigating: What is the relationship between industrial land use (Indus) and housing characteristics such as room sizes, property tax rates, and housing prices?
cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(cincinnati, aes(x=Indus))+
  geom_density()+
  labs(
    title = "Indus Density Plot",
    x = "Indus",
    y = "Density"
  )

# === Finding where to split the data ===
# Compute the density
density_values <- density(cincinnati$Indus)

# Find local minima
x <- density_values$x
y <- density_values$y

# Compare each y-value with its neighbors
local_minima <- which(diff(sign(diff(y))) == 2) + 1  # Indices of local minima


# Extract the x and y coordinates of the minima
minima_x <- x[local_minima]
minima_y <- y[local_minima]

# Print the local minima
data.frame(x = minima_x, y = minima_y)
##          x          y
## 1 13.60556 0.01937223
# x          y
# 1 13.60556 0.01937223

ggplot(cincinnati, aes(x = Indus)) +
  geom_density() +
  geom_vline(xintercept = minima_x, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Indus Density Plot Split",
    x = "Indus",
    y = "Density"
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# === Splitting Indus -> low + high ===
cincinnati$Indus <- cut(
  cincinnati$Indus,
  breaks = c(0, minima_x, Inf),
  labels = c("low", "high")      
)

ggpairs(cincinnati[, c("Indus", "AvgRoom", "Tax", "MedPrice")], aes(color=cincinnati$Indus, alpha=0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation: There seems to be roughly double the amount of low Indus as high Indus, The Indus level doesn’t have much effect of on average room number (AvgRoom) Median price (MedPrice) is shows slight change in distribution between the Indus levels. On average a house in a high Indus area will have a lower MedPrice. Tax rate (Tax) however, varies wildly between low and high Indus levels. Tax rates in a high Indus area will on average, be much higher than those in a low area.

Data Transformations/Categorizations

cincinnati <- read_csv("CINCINNATI.csv") #Reset data
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Investigate bimodality in Indus, convert to categorical variable

#There appears to be bimodality with Indus
dens <- density(cincinnati$Indus)
plot(dens, main = "Checking for Bimodality in Indus")
polygon(dens, col = "skyblue", border = "darkblue")
#Find local minima
local_mins <- which(diff(sign(diff(dens$y))) == 2) + 1
min_x_values <- dens$x[local_mins]
min_y_values <- dens$y[local_mins]
#Add vertical lines and points at the local minima
abline(v = min_x_values, col = "red", lty = 2)
points(min_x_values, min_y_values, col = "red", pch = 19)

#Print the local minimum x-values
min_x_values
## [1] 13.60556
#Convert Indus to a categorical variable.
cincinnati$IndusCat <- ifelse(cincinnati$Indus < min_x_values, "Low", "High")
cincinnati$IndusCat <- factor(cincinnati$IndusCat, levels = c("Low", "High"))
ggpairs(data = cincinnati, aes(color = IndusCat, alpha = 0.6))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Investigate weird spread of CrimeRate, convert to categorical variable

ggplot(cincinnati, aes(x=CrimeRate))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cincinnati$CrimeRate)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01096  0.07250  0.19160  3.34698  2.69719 88.97620
 #    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 # 0.01096  0.07250  0.19160  3.34698  2.69719 88.97620 

#recoding crimerate
# breakpoints: (0, Q1, Q3, inf) -> low, medium, high

#cincinnati$CrimeRate <- cut(
#  cincinnati$CrimeRate,
#  breaks = c(0, 0.07250, 2.69719, Inf),    # Define breakpoints for "low", "medium", "high"
#  labels = c("low", "medium", "high")      # Labels for each category
#)
#ggpairs(cincinnati, aes(color=cincinnati$CrimeRate, alpha=0.5))


# ==== alt crimerate breakpoints ====

#cincinnati <- read.csv("CINCINNATI.csv")
#cincinnati$River <- recode(cincinnati$River, `0`="otherwise", `1`="bounds river")

# using (0, Q1, Q3, inf) as breakpoints for low, medium, high didn't seem to represent it well
# using arbitrary (0, 1, 5, inf) seems like better breakpoints

cincinnati$CrimeRateCat <- cut(
  cincinnati$CrimeRate,
  breaks = c(0, 1, 5, Inf),    # Define breakpoints for "low", "medium", "high"
  labels = c("Low", "Medium", "High")      # Labels for each category
)
cincinnati$CrimeRateCat <- factor(cincinnati$CrimeRateCat, levels = c("Low", "Medium", "High"))
ggpairs(cincinnati, aes(color=cincinnati$CrimeRateCat, alpha=0.5))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#create binary variables for categorical variable
cincinnati$IsLowCrime <- ifelse(cincinnati$CrimeRateCat == "Low", 1, 0)
cincinnati$IsMediumCrime <- ifelse(cincinnati$CrimeRateCat == "Medium", 1, 0)
cincinnati$IsHighCrime <- ifelse(cincinnati$CrimeRateCat == "High", 1, 0)

Individual Variable Residual Analysis

#CrimeRate
m <- lm(MedPrice ~ CrimeRateCat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

#examine each category vs MedPrice
ggplot(cincinnati, aes(x = CrimeRate, y = MedPrice, color = CrimeRateCat)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Crime vs Median Housing Price",
       x = "Crime Rate",
       y = "Median Housing Price",
       color = "Crime Category")

Findings: Looks fine, nothing of note here. Confirms that higher crime rates generally correlate to lower housing prices

#IndusCat
m <- lm(MedPrice ~ IndusCat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

#examine each category vs MedPrice
ggplot(cincinnati, aes(x = Indus, y = MedPrice, color = IndusCat)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Indus vs Median Housing Price",
       x = "Indus",
       y = "Median Housing Price",
       color = "Indus Category")

Findings: Looks fine, nothing of note. Confirms that higher Indus categories generally have a moderately lower housing price

#River
m <- lm(MedPrice ~ River, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

Findings: Looks mostly ok, the normality looks slightly skewed but not too alarming.

#AvgRoom
m <- lm(MedPrice ~ AvgRoom, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

Findings: Overall, it looks ok. The Residuals vs Fitted plot has a moderate funnel shape, indicating possible slight levels of non-constant varience. The trend line on the Residuals vs Fitted plot isn’t too bad, so leave as is for now. Normality looks good.

#Age
m <- lm(MedPrice ~ Age, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

#What does log transform do?
m_log <- lm(log(MedPrice) ~ Age, data = cincinnati)
par(mfrow = c(2,2))
plot(m_log)

Findings: Residual plot looks a bit concerning. Close grouping below residual line, spread grouping above it. Q-Q plot right side trends upward rather quickly. Doing a log transform on MedPrice seems to fix normality and tighten up the varience, though there is still a reverse funnel shape, indicating more variance at lower prices. This makes sense, since lower-priced homes are more susceptible to factors like crime, age, etc. which aren’t in the model here.

#Tax
m <- lm(MedPrice ~ Tax, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

Findings: We have a wierd grouping of tax rates on the left side, which corresponds to the high indus tax bracket, and then a normal spread on the right, corresponding to the low indus cities. On both sides, the residuals look evenly spread around the line, indicating equal varience despite the weird groupings. The normality also looks good.

#PTRatio
m <- lm(MedPrice ~ PTRatio, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

Findings: Looks ok, nothing too noteworthy here.

#LStat
m <- lm(MedPrice ~ LStat, data = cincinnati)
par(mfrow = c(2,2))
plot(m)

m1 <- lm(MedPrice ~ I(LStat^2), data = cincinnati)
par(mfrow = c(2,2))
plot(m1)

m2 <- lm(MedPrice ~ I(1/LStat), data = cincinnati)
par(mfrow = c(2,2))
plot(m2)

m3 <- lm(MedPrice ~ log(LStat), data = cincinnati)
par(mfrow = c(2,2))
plot(m3)

Findings: Initial plot of LStat indicates non-linearity in the residual plot, and also in the Q-Q plot via a sharp deviation on the right end of the line. To test the non-linearity theory, I applied two different transformations on LStat: 1. Square LStat: Standard way to try to linearize a relationship. This did nothing to help the residual plot, and seems to have only sharpened the trend. 2. 1/LStat: This was indicated by the shape of the LStat vs MedPrice plot earlier, where there seemed to be a 1/x relationship.

I also decided to try a log transform on LStat, which actually seemed to significantly improve the residual plot. The Q-Q plots of 1/LStat and log(LStat) seem almost the same, but the residual plot is the best for log(Lstat), indicating that doing the log transformations seems to significantly improve the variance.

Figure out how to handle Indus/Tax relationship

ggplot(cincinnati, aes(x = Tax, y = MedPrice, color = IndusCat)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Tax vs MedPrice w/IndusCat",
       x = "Tax",
       y = "MedPrice",
       color = "Indus Category")

ggplot(cincinnati, aes(x = Tax, y = MedPrice, color = IndusCat)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "black", linetype = "dashed") + # Overall trendline
  theme_minimal() +
  labs(title = "Tax vs MedPrice with Trendlines (by IndusCat and Overall)",
       x = "Tax",
       y = "MedPrice",
       color = "Indus Category")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Findings: The special group (High Indus, Tax = 666) isn’t drastically deviating in terms of the Tax relationship. The blue trendline, which includes the cluster at Tax ≈ 666, doesn’t show a sharp bend or a completely different slope at that high tax value compared to the rest of the blue line. This reinforces the observation that they might roughly follow a similar trend with respect to tax. For now, I’ll leave as is.

Split Data into Training/Test sets

set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(cincinnati), replace=TRUE, prob=c(0.7,0.3))
train  <- cincinnati[sample, ]
test   <- cincinnati[!sample, ]

Linear Regression - Full Model

full_model1 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + log(LStat), data = train)
summary(full_model1)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + 
##     Age + Tax + PTRatio + log(LStat), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5632  -2.4985  -0.2042   2.2954  12.4737 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22.057207   9.870928   2.235   0.0278 *  
## CrimeRateCatMedium -0.490999   1.841160  -0.267   0.7903    
## CrimeRateCatHigh   -4.650214   2.404492  -1.934   0.0561 .  
## IndusCatHigh        0.791996   1.744445   0.454   0.6508    
## River              -1.096145   1.947074  -0.563   0.5748    
## AvgRoom             4.205653   1.026139   4.099  8.7e-05 ***
## Age                 0.015824   0.022194   0.713   0.4776    
## Tax                -0.003583   0.006213  -0.577   0.5655    
## PTRatio            -0.598354   0.238513  -2.509   0.0138 *  
## log(LStat)         -6.148356   1.474797  -4.169  6.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.939 on 96 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7643 
## F-statistic: 38.83 on 9 and 96 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)

ad.test(full_model1$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model1$residuals
## A = 0.36117, p-value = 0.4397
full_model2 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train)
summary(full_model2)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + 
##     Age + Tax + PTRatio + I(1/LStat), data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4032 -2.5199  0.0155  2.1413 13.2817 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.805374   7.114944   0.394  0.69424    
## CrimeRateCatMedium -0.705842   1.787028  -0.395  0.69373    
## CrimeRateCatHigh   -6.041264   2.315745  -2.609  0.01054 *  
## IndusCatHigh        0.757633   1.691698   0.448  0.65527    
## River              -1.725336   1.893379  -0.911  0.36445    
## AvgRoom             4.325973   0.910087   4.753 7.02e-06 ***
## Age                 0.008490   0.019941   0.426  0.67124    
## Tax                -0.002642   0.006038  -0.438  0.66271    
## PTRatio            -0.633996   0.231168  -2.743  0.00727 ** 
## I(1/LStat)         45.385868   9.159374   4.955 3.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.82 on 96 degrees of freedom
## Multiple R-squared:  0.7973, Adjusted R-squared:  0.7783 
## F-statistic: 41.96 on 9 and 96 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model2)

ad.test(full_model2$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model2$residuals
## A = 0.2587, p-value = 0.7095

Findings: considering both the slightly better summary statistics and the potentially improved linearity suggested by the residuals plot, it appears that using I(1/LStat) is likely the better transformation for LStat in the linear regression model for predicting MedPrice in this dataset. Addtionally, the AD p-values for both models indicate that we don’t have strong evidence against normality.

Linear Regression - Best Subset Model Selection

best_subset_model <- regsubsets(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train, nbest = 2)
best_subset_summary <- summary(best_subset_model)

#Add RSS, Adjusted R², Cp, and BIC, MSE, etc to the same data frame
mse_values <- best_subset_summary$rss / (nrow(train) - 1:nrow(best_subset_summary$which))
mse_values <- round(mse_values, 2)

results <- data.frame(
  modelId = 1:nrow(best_subset_summary$which),
  best_subset_summary$outmat,
  Rsq = best_subset_summary$rsq,
  RSS = best_subset_summary$rss,
  AdjR2 = best_subset_summary$adjr2,
  Cp = best_subset_summary$cp,
  BIC = best_subset_summary$bic,
  MSE = mse_values
)
results$Rsq <- round(results$Rsq, 2)
results$RSS <- round(results$RSS, 2)
results$AdjR2 <- round(results$AdjR2, 2)
results$Cp <- round(results$Cp, 2)
results$BIC <- round(results$BIC, 2)


print(results)
##          modelId CrimeRateCatMedium CrimeRateCatHigh IndusCatHigh River AvgRoom
## 1  ( 1 )       1                                                               
## 1  ( 2 )       2                                                              *
## 2  ( 1 )       3                                   *                           
## 2  ( 2 )       4                                                               
## 3  ( 1 )       5                                   *                          *
## 3  ( 2 )       6                                                              *
## 4  ( 1 )       7                                   *                          *
## 4  ( 2 )       8                                   *                          *
## 5  ( 1 )       9                                   *                  *       *
## 5  ( 2 )      10                                   *                          *
## 6  ( 1 )      11                                   *                  *       *
## 6  ( 2 )      12                                   *                  *       *
## 7  ( 1 )      13                                   *                  *       *
## 7  ( 2 )      14                                   *            *     *       *
## 8  ( 1 )      15                                   *            *     *       *
## 8  ( 2 )      16                  *                *            *     *       *
##          Age Tax PTRatio I.1.LStat.  Rsq     RSS AdjR2     Cp     BIC   MSE
## 1  ( 1 )                          * 0.65 2430.32  0.65  64.52 -101.48 23.15
## 1  ( 2 )                            0.47 3643.46  0.47 147.65  -58.56 35.03
## 2  ( 1 )                          * 0.70 2048.51  0.70  40.36 -114.94 19.89
## 2  ( 2 )               *          * 0.70 2049.82  0.70  40.45 -114.87 20.10
## 3  ( 1 )                          * 0.78 1544.77  0.77   7.85 -140.19 15.29
## 3  ( 2 )       *                  * 0.76 1676.03  0.75  16.84 -131.55 16.76
## 4  ( 1 )               *          * 0.79 1424.74  0.79   1.62 -144.10 14.39
## 4  ( 2 )       *                  * 0.78 1528.43  0.77   8.73 -136.66 15.60
## 5  ( 1 )               *          * 0.80 1411.26  0.79   2.70 -140.45 14.55
## 5  ( 2 )       *       *          * 0.79 1419.18  0.78   3.24 -139.85 14.78
## 6  ( 1 )       *       *          * 0.80 1408.20  0.78   4.49 -136.01 14.82
## 6  ( 2 )   *           *          * 0.80 1408.31  0.78   4.50 -136.00 14.98
## 7  ( 1 )   *   *       *          * 0.80 1404.73  0.78   6.25 -131.61 15.10
## 7  ( 2 )       *       *          * 0.80 1405.12  0.78   6.28 -131.58 15.27
## 8  ( 1 )   *   *       *          * 0.80 1403.35  0.78   8.16 -127.05 15.42
## 8  ( 2 )       *       *          * 0.80 1403.72  0.78   8.18 -127.02 15.60

Make four plots to allow comparison between models

p1<-ggplot(data = results, aes(x = as.factor(modelId), y = MSE))+
  geom_point(color="black", size=2) +ylab("MSE") + xlab("ModelId")

p2<-ggplot(data = results, aes(x = as.factor(modelId), y = AdjR2)) +
  geom_point(color="black", size=2) +ylab("Adjusted RSQ") + xlab("ModelId")+
  geom_point(data = results[which.max(results$AdjR2), ], color="red", 
             size=3) 

p3<-ggplot(data = results, aes(x = as.factor(modelId), y = Cp)) +
  geom_point(color="black", size=2) +ylab("Cp") + xlab("ModelId")+
  geom_point(data = results[which.min(results$Cp), ], color="red", 
             size=3) 

p4<-ggplot(data = results, aes(x = as.factor(modelId), y = BIC)) +
  geom_point(color="black", size=2) +ylab("BIC") + xlab("ModelId")+
  geom_point(data = results[which.min(results$BIC), ], color="red", 
             size=3) 

grid.arrange(p1,p2,p3,p4,nrow=2)

Findings: It looks like model 7 is the clear winner. It is closely tied for lowest MSE, has the best Adjust RSQ, and lowest Cp and Bix values.

Get Model 7

coef(best_subset_model, 7)
##      (Intercept) CrimeRateCatHigh          AvgRoom          PTRatio 
##        1.0133460       -5.8807897        4.4816233       -0.6078183 
##       I(1/LStat) 
##       44.3289871
model7 <- lm(MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat), data=train)
summary(model7)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat), 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.081 -2.265  0.068  2.099 13.657 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.2525     6.8234   0.184  0.85473    
## CrimeRateCatMedium  -0.6695     1.1804  -0.567  0.57186    
## CrimeRateCatHigh    -6.0447     1.1389  -5.308 6.70e-07 ***
## AvgRoom              4.4965     0.8471   5.308 6.69e-07 ***
## PTRatio             -0.6133     0.2093  -2.930  0.00419 ** 
## I(1/LStat)          43.2888     7.9558   5.441 3.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared:  0.7946, Adjusted R-squared:  0.7843 
## F-statistic: 77.36 on 5 and 100 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)

ad.test(full_model1$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model1$residuals
## A = 0.36117, p-value = 0.4397
anova(model7)
## Analysis of Variance Table
## 
## Response: MedPrice
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## CrimeRateCat   2 2233.48 1116.74  78.634 < 2.2e-16 ***
## AvgRoom        1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio        1  214.14  214.14  15.078 0.0001853 ***
## I(1/LStat)     1  420.46  420.46  29.606 3.774e-07 ***
## Residuals    100 1420.17   14.20                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model7)
## [1] 1420.171
vif(model7)
##                  GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108  2        1.106886
## AvgRoom      1.974105  1        1.405029
## PTRatio      1.423441  1        1.193081
## I(1/LStat)   2.430542  1        1.559020

Findings: The linear model fitted to the training data, predicting MedPrice using CrimeRateCat, AvgRoom, PTRatio, and I(1/LStat), demonstrates a strong overall fit, with a high Multiple R-squared of 0.7946 and a significant F-statistic (p < 2.2e-16). The ANOVA results indicate that all four predictors, CrimeRateCat, AvgRoom, PTRatio, and I(1/LStat), are statistically significant contributors to explaining variation in median housing prices.

Specifically, a higher average number of rooms and a lower percentage of lower-status population (reflected by a higher value of 1/LStat) are associated with higher median prices. A higher pupil-teacher ratio (PTRatio) is associated with lower median prices. The crime rate category also has a significant impact, with areas classified as “High” showing substantially lower median prices compared to the “Low” reference group.

The residual sum of squares for the model is 1420.17, and an Anderson-Darling test on the residuals suggests no strong evidence against normality (p = 0.4397). Finally, variance inflation factors (VIFs) are all below 2, indicating that multicollinearity is not a serious concern in the model

Linear Regression - Forward Selection

null_model <- lm(MedPrice ~ 1, data = train)
upper_model <- formula(full_model2)
forward_model <- step(null_model, direction = "forward", scope = upper_model)
## Start:  AIC=444.84
## MedPrice ~ 1
## 
##                Df Sum of Sq    RSS    AIC
## + I(1/LStat)    1    4482.8 2430.3 336.03
## + AvgRoom       1    3269.6 3643.5 378.95
## + PTRatio       1    2358.7 4554.4 402.60
## + Tax           1    2326.8 4586.3 403.34
## + CrimeRateCat  2    2233.5 4679.6 407.48
## + IndusCat      1    1668.4 5244.7 417.56
## + Age           1    1650.9 5262.1 417.91
## <none>                      6913.1 444.84
## + River         1       1.9 6911.2 446.81
## 
## Step:  AIC=336.03
## MedPrice ~ I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## + PTRatio       1    380.50 2049.8 319.98
## + CrimeRateCat  2    382.33 2048.0 321.88
## + Tax           1    315.91 2114.4 323.27
## + AvgRoom       1    289.68 2140.6 324.57
## + IndusCat      1    112.04 2318.3 333.02
## <none>                      2430.3 336.03
## + Age           1      5.48 2424.8 337.79
## + River         1      0.00 2430.3 338.03
## 
## Step:  AIC=319.98
## MedPrice ~ I(1/LStat) + PTRatio
## 
##                Df Sum of Sq    RSS    AIC
## + AvgRoom       1   220.334 1829.5 309.93
## + CrimeRateCat  2   229.500 1820.3 311.39
## + Tax           1   159.924 1889.9 313.37
## + IndusCat      1    78.503 1971.3 317.84
## <none>                      2049.8 319.98
## + River         1    11.286 2038.5 321.39
## + Age           1     2.361 2047.5 321.86
## 
## Step:  AIC=309.93
## MedPrice ~ I(1/LStat) + PTRatio + AvgRoom
## 
##                Df Sum of Sq    RSS    AIC
## + CrimeRateCat  2    409.31 1420.2 287.08
## + Tax           1    280.97 1548.5 294.25
## + IndusCat      1    144.83 1684.7 303.18
## <none>                      1829.5 309.93
## + Age           1     16.27 1813.2 310.98
## + River         1      5.18 1824.3 311.62
## 
## Step:  AIC=287.08
## MedPrice ~ I(1/LStat) + PTRatio + AvgRoom + CrimeRateCat
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  1420.2 287.08
## + River     1   10.0761 1410.1 288.32
## + Age       1    3.0894 1417.1 288.85
## + Tax       1    1.9217 1418.2 288.94
## + IndusCat  1    1.6338 1418.5 288.96
coef(forward_model)
##        (Intercept)         I(1/LStat)            PTRatio            AvgRoom 
##          1.2525372         43.2887552         -0.6133069          4.4965043 
## CrimeRateCatMedium   CrimeRateCatHigh 
##         -0.6695093         -6.0447106
summary(forward_model)
## 
## Call:
## lm(formula = MedPrice ~ I(1/LStat) + PTRatio + AvgRoom + CrimeRateCat, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.081 -2.265  0.068  2.099 13.657 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.2525     6.8234   0.184  0.85473    
## I(1/LStat)          43.2888     7.9558   5.441 3.77e-07 ***
## PTRatio             -0.6133     0.2093  -2.930  0.00419 ** 
## AvgRoom              4.4965     0.8471   5.308 6.69e-07 ***
## CrimeRateCatMedium  -0.6695     1.1804  -0.567  0.57186    
## CrimeRateCatHigh    -6.0447     1.1389  -5.308 6.70e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared:  0.7946, Adjusted R-squared:  0.7843 
## F-statistic: 77.36 on 5 and 100 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(forward_model)

ad.test(forward_model$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  forward_model$residuals
## A = 0.29576, p-value = 0.5886
anova(forward_model)
## Analysis of Variance Table
## 
## Response: MedPrice
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## I(1/LStat)     1 4482.8  4482.8 315.650 < 2.2e-16 ***
## PTRatio        1  380.5   380.5  26.793 1.172e-06 ***
## AvgRoom        1  220.3   220.3  15.515 0.0001516 ***
## CrimeRateCat   2  409.3   204.7  14.411 3.167e-06 ***
## Residuals    100 1420.2    14.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(forward_model)
## [1] 1420.171
vif(forward_model)
##                  GVIF Df GVIF^(1/(2*Df))
## I(1/LStat)   2.430542  1        1.559020
## PTRatio      1.423441  1        1.193081
## AvgRoom      1.974105  1        1.405029
## CrimeRateCat 1.501108  2        1.106886

Findings: Same model as was produced by best subset selection (model7)

Linear Regression - Backward Selection

backward_model <- step(full_model2, direction = "backward")
## Start:  AIC=293.64
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + 
##     Tax + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - Age           1      2.65 1403.7 291.85
## - Tax           1      2.79 1403.9 291.86
## - IndusCat      1      2.93 1404.0 291.87
## - River         1     12.12 1413.2 292.56
## <none>                      1401.1 293.64
## - PTRatio       1    109.78 1510.8 299.64
## - CrimeRateCat  2    144.00 1545.1 300.01
## - AvgRoom       1    329.76 1730.8 314.05
## - I(1/LStat)    1    358.34 1759.4 315.79
## 
## Step:  AIC=291.84
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Tax + 
##     PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - Tax           1      4.22 1407.9 290.16
## - IndusCat      1      4.46 1408.2 290.18
## - River         1     10.89 1414.6 290.66
## <none>                      1403.7 291.84
## - PTRatio       1    109.12 1512.8 297.78
## - CrimeRateCat  2    141.36 1545.1 298.02
## - AvgRoom       1    390.26 1794.0 315.85
## - I(1/LStat)    1    417.23 1821.0 317.43
## 
## Step:  AIC=290.16
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + PTRatio + 
##     I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - IndusCat      1      2.16 1410.1 288.32
## - River         1     10.60 1418.5 288.96
## <none>                      1407.9 290.16
## - PTRatio       1    127.90 1535.8 297.38
## - CrimeRateCat  2    276.62 1684.6 305.18
## - AvgRoom       1    388.90 1796.8 314.02
## - I(1/LStat)    1    421.75 1829.7 315.94
## 
## Step:  AIC=288.33
## MedPrice ~ CrimeRateCat + River + AvgRoom + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - River         1     10.08 1420.2 287.08
## <none>                      1410.1 288.32
## - PTRatio       1    129.70 1539.8 295.65
## - CrimeRateCat  2    414.20 1824.3 311.62
## - AvgRoom       1    390.57 1800.7 312.24
## - I(1/LStat)    1    421.76 1831.8 314.06
## 
## Step:  AIC=287.08
## MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      1420.2 287.08
## - PTRatio       1    121.95 1542.1 293.81
## - CrimeRateCat  2    409.31 1829.5 309.93
## - AvgRoom       1    400.15 1820.3 311.39
## - I(1/LStat)    1    420.46 1840.6 312.57
coef(backward_model)
##        (Intercept) CrimeRateCatMedium   CrimeRateCatHigh            AvgRoom 
##          1.2525372         -0.6695093         -6.0447106          4.4965043 
##            PTRatio         I(1/LStat) 
##         -0.6133069         43.2887552
summary(backward_model)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat), 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.081 -2.265  0.068  2.099 13.657 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.2525     6.8234   0.184  0.85473    
## CrimeRateCatMedium  -0.6695     1.1804  -0.567  0.57186    
## CrimeRateCatHigh    -6.0447     1.1389  -5.308 6.70e-07 ***
## AvgRoom              4.4965     0.8471   5.308 6.69e-07 ***
## PTRatio             -0.6133     0.2093  -2.930  0.00419 ** 
## I(1/LStat)          43.2888     7.9558   5.441 3.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared:  0.7946, Adjusted R-squared:  0.7843 
## F-statistic: 77.36 on 5 and 100 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(backward_model)

ad.test(backward_model$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  backward_model$residuals
## A = 0.29576, p-value = 0.5886
anova(backward_model)
## Analysis of Variance Table
## 
## Response: MedPrice
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## CrimeRateCat   2 2233.48 1116.74  78.634 < 2.2e-16 ***
## AvgRoom        1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio        1  214.14  214.14  15.078 0.0001853 ***
## I(1/LStat)     1  420.46  420.46  29.606 3.774e-07 ***
## Residuals    100 1420.17   14.20                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(backward_model)
## [1] 1420.171
vif(backward_model)
##                  GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108  2        1.106886
## AvgRoom      1.974105  1        1.405029
## PTRatio      1.423441  1        1.193081
## I(1/LStat)   2.430542  1        1.559020

Findings: Also produced the exact same model as the forward selection and model7

Model Comparison & Analysis

#Comparing model7 with another model with tax rate.
# We took model 8 here which takes Crime Rate, Avg room. ,  Tax and  1/LStat which has Adj. Rsq:0.77, Cp:   8.73, BIC:-136.66   and MSE of 15.60
coef(best_subset_model, 8)
##      (Intercept) CrimeRateCatHigh          AvgRoom              Tax 
##     -11.61317438      -5.41653946       4.95850474      -0.00472546 
##       I(1/LStat) 
##      45.94558786
model8 <- lm(MedPrice ~ CrimeRateCat + AvgRoom + Tax + I(1/LStat), data=train)
summary(model8)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + Tax + I(1/LStat), 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9249  -2.5300   0.1701   2.0159  13.3584 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -11.436996   5.224027  -2.189   0.0309 *  
## CrimeRateCatMedium   0.280562   1.471531   0.191   0.8492    
## CrimeRateCatHigh    -5.171072   2.166700  -2.387   0.0189 *  
## AvgRoom              4.951500   0.862965   5.738 1.03e-07 ***
## Tax                 -0.005308   0.005497  -0.966   0.3366    
## I(1/LStat)          46.125787   8.206263   5.621 1.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.909 on 100 degrees of freedom
## Multiple R-squared:  0.779,  Adjusted R-squared:  0.7679 
## F-statistic: 70.49 on 5 and 100 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model1)

ad.test(full_model1$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model1$residuals
## A = 0.36117, p-value = 0.4397
anova(model8)
## Analysis of Variance Table
## 
## Response: MedPrice
##               Df  Sum Sq Mean Sq  F value    Pr(>F)    
## CrimeRateCat   2 2233.48 1116.74  73.0910 < 2.2e-16 ***
## AvgRoom        1 2624.84 2624.84 171.7969 < 2.2e-16 ***
## Tax            1   44.18   44.18   2.8916   0.09215 .  
## I(1/LStat)     1  482.71  482.71  31.5934 1.723e-07 ***
## Residuals    100 1527.88   15.28                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model8)
## [1] 1527.876
vif(model8)
##                  GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 5.044310  2        1.498651
## AvgRoom      1.904309  1        1.379967
## Tax          5.038647  1        2.244693
## I(1/LStat)   2.403677  1        1.550380

Findings: The linear regression model predicting MedPrice using CrimeRateCat, AvgRoom, Tax, and 1/LStat demonstrates a strong fit to the training data, with a Multiple R-squared of 0.779 and an Adjusted R-squared of 0.7679. The model is statistically significant overall, as indicated by a high F-statistic (70.49, p < 2.2e-16). Among the predictors, AvgRoom and 1/LStat are highly significant (p < 0.001), suggesting that homes with more rooms and a lower proportion of lower-status residents (captured by a higher value of 1/LStat) are associated with higher median housing prices.

Additionally, CrimeRateCatHigh is statistically significant (p = 0.0189), indicating that homes in high crime areas tend to have notably lower median prices compared to low crime areas. However, the Tax variable and CrimeRateCatMedium are not statistically significant (p > 0.3 and p > 0.8 respectively), suggesting weaker explanatory power in the current model context.

The residual standard error is 3.909 on 100 degrees of freedom, and the residuals appear symmetrically distributed, as visualized in the diagnostic plots. An Anderson-Darling test on the residuals does not indicate a violation of normality assumptions. Variance Inflation Factors (VIFs) are all within acceptable ranges, indicating no significant multicollinearity concerns. Overall, while this model is slightly weaker than Model 7, it still captures key predictors of housing prices with relatively strong explanatory power.

#Calculate MSPE with test data, compare

# Predict MedPrice using model 7 and model 8 on test data
pred_model7 <- predict(model7, newdata = test)
pred_model8 <- predict(model8, newdata = test)
# Mean Squared Prediction Error
mspe_model7 <- mean((test$MedPrice - pred_model7)^2)
mspe_model8 <- mean((test$MedPrice - pred_model8)^2)

# Print the results
mspe_model7
## [1] 15.2939
mspe_model8
## [1] 15.67446
#

Experiment building models With Removing Tax Rate outlier (666)

#Drop outliers from train and test

cincinnati <- read_csv("CINCINNATI.csv")
## Rows: 150 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (9): CrimeRate, Indus, River, AvgRoom, Age, Tax, PTRatio, LStat, MedPrice
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cincinnati_clean <- cincinnati %>% 
  filter(Tax <= 600)
head(cincinnati_clean)
## # A tibble: 6 × 9
##   CrimeRate Indus River AvgRoom   Age   Tax PTRatio LStat MedPrice
##       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl>
## 1    0.0273  7.07     0    7.18    61   242    17.8  4.03     34.7
## 2    0.254   9.9      0    5.70    78   304    18.4 11.5      16.2
## 3    0.0396  5.19     0    6.04    35   224    20.2  8.01     21.1
## 4    0.145  10.0      0    5.73    65   432    17.8 13.6      19.3
## 5    0.786   3.97     0    7.01    85   264    13   14.8      30.7
## 6    0.0110  2.25     0    6.45    32   300    15.3  8.23     22
#IndusCat

#There appears to be bimodality with Indus
dens <- density(cincinnati_clean$Indus)
plot(dens, main = "Checking for Bimodality in Indus")
polygon(dens, col = "skyblue", border = "darkblue")
#Find local minima
local_mins <- which(diff(sign(diff(dens$y))) == 2) + 1
min_x_values <- dens$x[local_mins]
min_y_values <- dens$y[local_mins]
#Add vertical lines and points at the local minima
abline(v = min_x_values, col = "red", lty = 2)
points(min_x_values, min_y_values, col = "red", pch = 19)

#Print the local minimum x-values
min_x_values
## [1] 16.37088
#Convert Indus to a categorical variable.
cincinnati_clean$IndusCat <- ifelse(cincinnati_clean$Indus < min_x_values, "Low", "High")
cincinnati_clean$IndusCat <- factor(cincinnati_clean$IndusCat, levels = c("Low", "High"))
#ggpairs(data = cincinnati_clean, aes(color = IndusCat, alpha = 0.6))

#CrimeRateCat
cincinnati_clean$CrimeRateCat <- cut(
  cincinnati_clean$CrimeRate,
  breaks = c(0, 1, 5, Inf),    # Define breakpoints for "low", "medium", "high"
  labels = c("Low", "Medium", "High")      # Labels for each category
)
cincinnati_clean$CrimeRateCat <- factor(cincinnati_clean$CrimeRateCat, levels = c("Low", "Medium", "High"))
#ggpairs(cincinnati_clean, aes(color=cincinnati_clean$CrimeRateCat, alpha=0.5))

#create binary variables for categorical variable
cincinnati_clean$IsLowCrime <- ifelse(cincinnati_clean$CrimeRateCat == "Low", 1, 0)
cincinnati_clean$IsMediumCrime <- ifelse(cincinnati_clean$CrimeRateCat == "Medium", 1, 0)
cincinnati_clean$IsHighCrime <- ifelse(cincinnati_clean$CrimeRateCat == "High", 1, 0)
#train and test Set
set.seed(123)
sample <- sample(c(TRUE, FALSE), nrow(cincinnati_clean), replace=TRUE, prob=c(0.7,0.3))
train_clean  <- cincinnati_clean[sample, ]
test_clean   <- cincinnati_clean[!sample, ]
full_model3 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + log(LStat), data = train_clean)
summary(full_model3)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + 
##     Age + Tax + PTRatio + log(LStat), data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0524 -2.0463 -0.7214  2.1495 11.4016 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -15.301608  11.694389  -1.308  0.19494    
## CrimeRateCatMedium  -1.210951   1.942831  -0.623  0.53509    
## IndusCatHigh         1.289027   1.884267   0.684  0.49614    
## River               -0.581074   1.882440  -0.309  0.75847    
## AvgRoom              9.160766   1.214329   7.544 1.17e-10 ***
## Age                 -0.029340   0.022859  -1.284  0.20349    
## Tax                 -0.021631   0.006402  -3.379  0.00118 ** 
## PTRatio             -0.516995   0.255485  -2.024  0.04678 *  
## log(LStat)          -0.439619   1.741898  -0.252  0.80148    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.64 on 71 degrees of freedom
## Multiple R-squared:  0.8202, Adjusted R-squared:  0.7999 
## F-statistic: 40.48 on 8 and 71 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(full_model3)

ad.test(full_model3$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model3$residuals
## A = 0.52605, p-value = 0.1747
full_model4 <- lm(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train_clean)
summary(full_model4)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + 
##     Age + Tax + PTRatio + I(1/LStat), data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6036 -2.1352 -0.3135  2.1639 11.6091 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -10.081151   8.547681  -1.179  0.24218    
## CrimeRateCatMedium  -1.419221   1.895672  -0.749  0.45653    
## IndusCatHigh         1.285426   1.829310   0.703  0.48455    
## River               -0.961953   1.843444  -0.522  0.60342    
## AvgRoom              7.658635   1.140155   6.717 3.86e-09 ***
## Age                 -0.011516   0.020463  -0.563  0.57538    
## Tax                 -0.020728   0.006217  -3.334  0.00136 ** 
## PTRatio             -0.550077   0.249258  -2.207  0.03056 *  
## I(1/LStat)          20.064800  10.272092   1.953  0.05472 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.547 on 71 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:   0.81 
## F-statistic: 43.09 on 8 and 71 DF,  p-value: < 2.2e-16
plot(full_model4)

ad.test(full_model4$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  full_model4$residuals
## A = 0.32316, p-value = 0.5207
# second model (fullmodel 4 ) is performing better in this case as well.Doing bestsubset on full_model4 now 
best_subset_model_clean <- regsubsets(MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + Tax + PTRatio + I(1/LStat), data = train_clean, nbest = 2)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
## Reordering variables and trying again:
best_subset_summary_clean <- summary(best_subset_model_clean)

#Add RSS, Adjusted R², Cp, and BIC, MSE, etc to the same data frame
mse_values <- best_subset_summary_clean$rss / (nrow(train) - 1:nrow(best_subset_summary_clean$which))
mse_values <- round(mse_values, 2)

results_clean <- data.frame(
  NumVar = 1:nrow(best_subset_summary_clean$which),
  best_subset_summary_clean$outmat,
  Rsq = best_subset_summary_clean$rsq,
  RSS = best_subset_summary_clean$rss,
  AdjR2 = best_subset_summary_clean$adjr2,
  Cp = best_subset_summary_clean$cp,
  BIC = best_subset_summary_clean$bic,
  MSE = mse_values
)
results_clean$Rsq <- round(results_clean$Rsq, 2)
results_clean$RSS <- round(results_clean$RSS, 2)
results_clean$AdjR2 <- round(results_clean$AdjR2, 2)
results_clean$Cp <- round(results_clean$Cp, 2)
results_clean$BIC <- round(results_clean$BIC, 2)


print(results_clean)
##          NumVar CrimeRateCatMedium CrimeRateCatHigh IndusCatHigh River AvgRoom
## 1  ( 1 )      1                                                              *
## 1  ( 2 )      2                                                               
## 2  ( 1 )      3                                                              *
## 2  ( 2 )      4                                                              *
## 3  ( 1 )      5                                                              *
## 3  ( 2 )      6                                                              *
## 4  ( 1 )      7                                                              *
## 4  ( 2 )      8                                                              *
## 5  ( 1 )      9                                                              *
## 5  ( 2 )     10                  *                                           *
## 6  ( 1 )     11                  *                                   *       *
## 6  ( 2 )     12                  *                                           *
## 7  ( 1 )     13                  *                             *             *
## 7  ( 2 )     14                  *                             *     *       *
## 8  ( 1 )     15                  *                             *     *       *
## 8  ( 2 )     16                  *                *            *             *
##          Age Tax PTRatio I.1.LStat.  Rsq     RSS AdjR2    Cp     BIC   MSE
## 1  ( 1 )                            0.75 1330.25  0.74 28.24 -100.77 12.67
## 1  ( 2 )                          * 0.64 1888.69  0.63 72.00  -72.73 18.16
## 2  ( 1 )       *                    0.79 1110.74  0.78 13.04 -110.82 10.78
## 2  ( 2 )                          * 0.78 1149.95  0.77 16.11 -108.04 11.27
## 3  ( 1 )       *                  * 0.81  986.64  0.80  5.31 -115.91  9.77
## 3  ( 2 )   *   *                    0.81 1006.49  0.80  6.87 -114.32 10.06
## 4  ( 1 )       *       *          * 0.83  914.90  0.82  1.69 -117.57  9.24
## 4  ( 2 )   *   *       *            0.82  950.14  0.81  4.45 -114.54  9.70
## 5  ( 1 )   *   *       *          * 0.83  906.23  0.82  3.01 -113.95  9.34
## 5  ( 2 )       *       *          * 0.83  906.39  0.82  3.02 -113.93  9.44
## 6  ( 1 )       *       *          * 0.83  901.98  0.81  4.68 -109.94  9.49
## 6  ( 2 )   *   *       *          * 0.83  902.29  0.81  4.70 -109.91  9.60
## 7  ( 1 )   *   *       *          * 0.83  896.75  0.81  6.27 -106.03  9.64
## 7  ( 2 )       *       *          * 0.83  897.31  0.81  6.31 -105.98  9.75
## 8  ( 1 )   *   *       *          * 0.83  893.32  0.81  8.00 -101.95  9.82
## 8  ( 2 )   *   *       *          * 0.83  896.75  0.81  8.27 -101.64  9.96
# four plots for comparison

p1<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = MSE))+
  geom_point(color="black", size=2) +ylab("MSE") + xlab("No of variables")
  

p2<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = AdjR2)) +
  geom_point(color="black", size=2) +ylab("Adjusted RSQ") + xlab("No of variables")+
  geom_point(data = results_clean[which.max(results_clean$AdjR2), ], color="red", 
             size=3) 

p3<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = Cp)) +
  geom_point(color="black", size=2) +ylab("Cp") + xlab("No of variables")+
  geom_point(data = results_clean[which.min(results_clean$Cp), ], color="red", 
             size=3) 

p4<-ggplot(data = results_clean, aes(x = as.factor(NumVar), y = BIC)) +
  geom_point(color="black", size=2) +ylab("BIC") + xlab("No of variables")+
  geom_point(data = results_clean[which.min(results_clean$BIC), ], color="red", 
             size=3) 

grid.arrange(p1,p2,p3,p4,nrow=2)

#Findings: Model 7 is the winner in all cases but in model 7 , predictors has changed
# CrimeRate is dropped and Tax is added as the new predictros keeping othe predictors as the same (AVGRooom, PTRatio, 1/LSAt) ,getting model 7 

coef(best_subset_model_clean, 7)
##      (Intercept)              Age          PTRatio       I(1/LStat) 
##      26.69642489       0.02589852      -0.81653124      79.84524930 
## CrimeRateCatHigh 
##       0.00000000
model7_clean <- lm(MedPrice ~ AvgRoom + Tax +  PTRatio + I(1/LStat), data=train_clean)
summary(model7_clean)
## 
## Call:
## lm(formula = MedPrice ~ AvgRoom + Tax + PTRatio + I(1/LStat), 
##     data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8828 -1.9872 -0.0568  2.1489 11.6709 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.439828   7.802814  -1.338 0.184953    
## AvgRoom       7.498197   1.022491   7.333 2.19e-10 ***
## Tax          -0.020854   0.005742  -3.632 0.000512 ***
## PTRatio      -0.538243   0.221944  -2.425 0.017709 *  
## I(1/LStat)   23.676669   8.353881   2.834 0.005899 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.493 on 75 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8158 
## F-statistic: 88.45 on 4 and 75 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model7_clean)

ad.test(model7_clean$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  model7_clean$residuals
## A = 0.20647, p-value = 0.8648
anova(model7_clean)
## Analysis of Variance Table
## 
## Response: MedPrice
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## AvgRoom     1 3900.4  3900.4 319.7439 < 2.2e-16 ***
## Tax         1  219.5   219.5  17.9948 6.255e-05 ***
## PTRatio     1   97.9    97.9   8.0215  0.005933 ** 
## I(1/LStat)  1   98.0    98.0   8.0328  0.005899 ** 
## Residuals  75  914.9    12.2                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(model7_clean)
## [1] 914.8952
vif(model7_clean)
##    AvgRoom        Tax    PTRatio I(1/LStat) 
##   2.767988   1.091269   1.177228   2.848722
#Findings : 82.5% of the variation is explained by these four predictors , better than before dropping tax,which was 78.5% 
#Linear Regression - Forward Selection

null_model_clean <- lm(MedPrice ~ 1, data = train_clean)
upper_model_clean <- formula(full_model4)
forward_model_clean <- step(null_model_clean, direction = "forward", scope = upper_model_clean)
## Start:  AIC=336.42
## MedPrice ~ 1
## 
##                Df Sum of Sq    RSS    AIC
## + AvgRoom       1    3900.4 1330.2 228.89
## + I(1/LStat)    1    3342.0 1888.7 256.93
## + PTRatio       1    1057.4 4173.2 320.35
## + Tax           1     865.4 4365.3 323.95
## + Age           1     861.9 4368.8 324.02
## + CrimeRateCat  1     514.1 4716.6 330.15
## + IndusCat      1     450.6 4780.1 331.21
## <none>                      5230.7 336.42
## + River         1      40.6 5190.1 337.80
## 
## Step:  AIC=228.89
## MedPrice ~ AvgRoom
## 
##                Df Sum of Sq    RSS    AIC
## + Tax           1   219.512 1110.7 216.46
## + I(1/LStat)    1   180.298 1150.0 219.24
## + Age           1   156.664 1173.6 220.86
## + PTRatio       1   107.848 1222.4 224.12
## + CrimeRateCat  1    60.860 1269.4 227.14
## + IndusCat      1    45.794 1284.5 228.09
## <none>                      1330.2 228.89
## + River         1     0.650 1329.6 230.85
## 
## Step:  AIC=216.46
## MedPrice ~ AvgRoom + Tax
## 
##                Df Sum of Sq     RSS    AIC
## + I(1/LStat)    1   124.097  986.64 208.98
## + Age           1   104.242 1006.49 210.58
## + PTRatio       1    97.851 1012.88 211.08
## <none>                      1110.74 216.46
## + CrimeRateCat  1    14.155 1096.58 217.43
## + IndusCat      1     2.870 1107.87 218.25
## + River         1     0.228 1110.51 218.44
## 
## Step:  AIC=208.98
## MedPrice ~ AvgRoom + Tax + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## + PTRatio       1    71.743 914.90 204.94
## <none>                      986.64 208.98
## + Age           1    22.672 963.97 209.12
## + CrimeRateCat  1     4.792 981.85 210.59
## + IndusCat      1     0.187 986.45 210.97
## + River         1     0.050 986.59 210.98
## 
## Step:  AIC=204.94
## MedPrice ~ AvgRoom + Tax + I(1/LStat) + PTRatio
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      914.90 204.94
## + Age           1    8.6694 906.23 206.18
## + CrimeRateCat  1    8.5067 906.39 206.19
## + River         1    7.8304 907.06 206.25
## + IndusCat      1    0.0039 914.89 206.94
coef(forward_model_clean)
##  (Intercept)      AvgRoom          Tax   I(1/LStat)      PTRatio 
## -10.43982834   7.49819680  -0.02085357  23.67666901  -0.53824336
summary(forward_model_clean)
## 
## Call:
## lm(formula = MedPrice ~ AvgRoom + Tax + I(1/LStat) + PTRatio, 
##     data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8828 -1.9872 -0.0568  2.1489 11.6709 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.439828   7.802814  -1.338 0.184953    
## AvgRoom       7.498197   1.022491   7.333 2.19e-10 ***
## Tax          -0.020854   0.005742  -3.632 0.000512 ***
## I(1/LStat)   23.676669   8.353881   2.834 0.005899 ** 
## PTRatio      -0.538243   0.221944  -2.425 0.017709 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.493 on 75 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:  0.8158 
## F-statistic: 88.45 on 4 and 75 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(forward_model_clean)

ad.test(forward_model_clean$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  forward_model_clean$residuals
## A = 0.20647, p-value = 0.8648
anova(forward_model_clean)
## Analysis of Variance Table
## 
## Response: MedPrice
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## AvgRoom     1 3900.4  3900.4 319.7439 < 2.2e-16 ***
## Tax         1  219.5   219.5  17.9948 6.255e-05 ***
## I(1/LStat)  1  124.1   124.1  10.1730  0.002081 ** 
## PTRatio     1   71.7    71.7   5.8813  0.017709 *  
## Residuals  75  914.9    12.2                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(forward_model_clean)
## [1] 914.8952
vif(forward_model_clean)
##    AvgRoom        Tax I(1/LStat)    PTRatio 
##   2.767988   1.091269   2.848722   1.177228
#Backward Selection
backward_model_clean <- step(full_model2, direction = "backward")
## Start:  AIC=293.64
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Age + 
##     Tax + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - Age           1      2.65 1403.7 291.85
## - Tax           1      2.79 1403.9 291.86
## - IndusCat      1      2.93 1404.0 291.87
## - River         1     12.12 1413.2 292.56
## <none>                      1401.1 293.64
## - PTRatio       1    109.78 1510.8 299.64
## - CrimeRateCat  2    144.00 1545.1 300.01
## - AvgRoom       1    329.76 1730.8 314.05
## - I(1/LStat)    1    358.34 1759.4 315.79
## 
## Step:  AIC=291.84
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + Tax + 
##     PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - Tax           1      4.22 1407.9 290.16
## - IndusCat      1      4.46 1408.2 290.18
## - River         1     10.89 1414.6 290.66
## <none>                      1403.7 291.84
## - PTRatio       1    109.12 1512.8 297.78
## - CrimeRateCat  2    141.36 1545.1 298.02
## - AvgRoom       1    390.26 1794.0 315.85
## - I(1/LStat)    1    417.23 1821.0 317.43
## 
## Step:  AIC=290.16
## MedPrice ~ CrimeRateCat + IndusCat + River + AvgRoom + PTRatio + 
##     I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - IndusCat      1      2.16 1410.1 288.32
## - River         1     10.60 1418.5 288.96
## <none>                      1407.9 290.16
## - PTRatio       1    127.90 1535.8 297.38
## - CrimeRateCat  2    276.62 1684.6 305.18
## - AvgRoom       1    388.90 1796.8 314.02
## - I(1/LStat)    1    421.75 1829.7 315.94
## 
## Step:  AIC=288.33
## MedPrice ~ CrimeRateCat + River + AvgRoom + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## - River         1     10.08 1420.2 287.08
## <none>                      1410.1 288.32
## - PTRatio       1    129.70 1539.8 295.65
## - CrimeRateCat  2    414.20 1824.3 311.62
## - AvgRoom       1    390.57 1800.7 312.24
## - I(1/LStat)    1    421.76 1831.8 314.06
## 
## Step:  AIC=287.08
## MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat)
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      1420.2 287.08
## - PTRatio       1    121.95 1542.1 293.81
## - CrimeRateCat  2    409.31 1829.5 309.93
## - AvgRoom       1    400.15 1820.3 311.39
## - I(1/LStat)    1    420.46 1840.6 312.57
coef(backward_model_clean)
##        (Intercept) CrimeRateCatMedium   CrimeRateCatHigh            AvgRoom 
##          1.2525372         -0.6695093         -6.0447106          4.4965043 
##            PTRatio         I(1/LStat) 
##         -0.6133069         43.2887552
summary(backward_model_clean)
## 
## Call:
## lm(formula = MedPrice ~ CrimeRateCat + AvgRoom + PTRatio + I(1/LStat), 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.081 -2.265  0.068  2.099 13.657 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.2525     6.8234   0.184  0.85473    
## CrimeRateCatMedium  -0.6695     1.1804  -0.567  0.57186    
## CrimeRateCatHigh    -6.0447     1.1389  -5.308 6.70e-07 ***
## AvgRoom              4.4965     0.8471   5.308 6.69e-07 ***
## PTRatio             -0.6133     0.2093  -2.930  0.00419 ** 
## I(1/LStat)          43.2888     7.9558   5.441 3.77e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.769 on 100 degrees of freedom
## Multiple R-squared:  0.7946, Adjusted R-squared:  0.7843 
## F-statistic: 77.36 on 5 and 100 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(backward_model_clean)

ad.test(backward_model_clean$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  backward_model_clean$residuals
## A = 0.29576, p-value = 0.5886
anova(backward_model_clean)
## Analysis of Variance Table
## 
## Response: MedPrice
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## CrimeRateCat   2 2233.48 1116.74  78.634 < 2.2e-16 ***
## AvgRoom        1 2624.84 2624.84 184.826 < 2.2e-16 ***
## PTRatio        1  214.14  214.14  15.078 0.0001853 ***
## I(1/LStat)     1  420.46  420.46  29.606 3.774e-07 ***
## Residuals    100 1420.17   14.20                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
deviance(backward_model_clean)
## [1] 1420.171
vif(backward_model_clean)
##                  GVIF Df GVIF^(1/(2*Df))
## CrimeRateCat 1.501108  2        1.106886
## AvgRoom      1.974105  1        1.405029
## PTRatio      1.423441  1        1.193081
## I(1/LStat)   2.430542  1        1.559020
#Findings: Both Forward and Backward selection produced the exact ouput as model 7 that we got from best subset.

Analysis

When we dropped the Tax at 666 and did the building process again, we found that the model it picks is the same which is model 7 with 4 predictors but the predictors has changed where CrimeRate is dropped and Tax is added as the predictor where AvgRoom , PTRatio and 1/LSat is the same. The model with the tax dropped is slightly better as the adjR^2 is 3% more in the model where we have dropped the tax outliers.

#Notes 4/16/2025 Start with focus on demography. Figure out how much of the houses are on the river. See if there are not many on the river, like 6 or so, if so, they may be outliers and can be dropped. We only have 7 houses on the river, these might destory our analysis. Make a boxplot to check for outliers. If so, check to see if those prices are riverfront. If so, drop them. If not, keep them in.

Next, check age. We have a range from 3-100. Google and see what the cutoffs are, break it into categories.

Plot only y = B0 + B1x1. Make a bunch of residual plots to see if the model looks good. If several have a non constant variance , transform y variable using ln(y). Then, check to see if ln(y) = B0 + B1x1 If we see a curve indicating non linearity, add x^2 or whatever to give y = B0 + B1x1 + B2X^2. The VIF here will be huge but that doesn’t matter.

Everytime we do changes, we need to fit and check the residuals and see if it improves.

Once we decide on all the transformations, we start building. Not just best subsets, but also forward and backaward selection. Check if these models are in the best subset.

To check interactions with categorical variables, we would make a seperate chart for each option in the categorical variable to check each option to see if it interacted. In our case, plot Riverfront x1 and x2 vs medPrice.

It is very hard to find interaction in quantitative varables. We can cateorgize it and make plots out of it to check.